Client Report - The War with Star Wars

Course DS 250

Author

Hugo Coronado

Show the code
import pandas as pd 
import numpy as np
import sklearn
from lets_plot import *
LetsPlot.setup_html(isolated_frame=True)

print("Pandas:", pd.__version__)
print("NumPy:", np.__version__)
print("sklearn:", sklearn.__version__)

url = "https://raw.githubusercontent.com/fivethirtyeight/data/master/star-wars-survey/StarWars.csv"
df = pd.read_csv(url, encoding="ISO-8859-1")

df.head()  
Pandas: 2.3.2
NumPy: 2.3.3
sklearn: 1.7.2
RespondentID Have you seen any of the 6 films in the Star Wars franchise? Do you consider yourself to be a fan of the Star Wars film franchise? Which of the following Star Wars films have you seen? Please select all that apply. Unnamed: 4 Unnamed: 5 Unnamed: 6 Unnamed: 7 Unnamed: 8 Please rank the Star Wars films in order of preference with 1 being your favorite film in the franchise and 6 being your least favorite film. ... Unnamed: 28 Which character shot first? Are you familiar with the Expanded Universe? Do you consider yourself to be a fan of the Expanded Universe?ξ Do you consider yourself to be a fan of the Star Trek franchise? Gender Age Household Income Education Location (Census Region)
0 NaN Response Response Star Wars: Episode I The Phantom Menace Star Wars: Episode II Attack of the Clones Star Wars: Episode III Revenge of the Sith Star Wars: Episode IV A New Hope Star Wars: Episode V The Empire Strikes Back Star Wars: Episode VI Return of the Jedi Star Wars: Episode I The Phantom Menace ... Yoda Response Response Response Response Response Response Response Response Response
1 3.292880e+09 Yes Yes Star Wars: Episode I The Phantom Menace Star Wars: Episode II Attack of the Clones Star Wars: Episode III Revenge of the Sith Star Wars: Episode IV A New Hope Star Wars: Episode V The Empire Strikes Back Star Wars: Episode VI Return of the Jedi 3 ... Very favorably I don't understand this question Yes No No Male 18-29 NaN High school degree South Atlantic
2 3.292880e+09 No NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN Yes Male 18-29 $0 - $24,999 Bachelor degree West South Central
3 3.292765e+09 Yes No Star Wars: Episode I The Phantom Menace Star Wars: Episode II Attack of the Clones Star Wars: Episode III Revenge of the Sith NaN NaN NaN 1 ... Unfamiliar (N/A) I don't understand this question No NaN No Male 18-29 $0 - $24,999 High school degree West North Central
4 3.292763e+09 Yes Yes Star Wars: Episode I The Phantom Menace Star Wars: Episode II Attack of the Clones Star Wars: Episode III Revenge of the Sith Star Wars: Episode IV A New Hope Star Wars: Episode V The Empire Strikes Back Star Wars: Episode VI Return of the Jedi 5 ... Very favorably I don't understand this question No NaN Yes Male 18-29 $100,000 - $149,999 Some college or Associate degree West North Central

5 rows × 38 columns

Show the code
# Learn morea about Code Cells: https://quarto.org/docs/reference/cells/cells-jupyter.html

# Include and execute your code here

# import your data here using pandas and the URL

Elevator pitch

A SHORT (2-3 SENTENCES) PARAGRAPH THAT DESCRIBES KEY INSIGHTS TAKEN FROM METRICS IN THE PROJECT RESULTS THINK TOP OR MOST IMPORTANT RESULTS. (Note: this is not a summary of the project, but a summary of the results.)

A Client has requested this analysis and this is your one shot of what you would say to your boss in a 2 min elevator ride before he takes your report and hands it to the client.

QUESTION|TASK 1

Shorten the column names and clean them up for easier use with pandas. Provide a table or list that exemplifies how you fixed the names.

The original dataset contained long survey question text and many “Unnamed” auto-generated column labels. I cleaned the column names using a mapping dictionary to shorten the labels and make them easier to use for modeling. Below is a sample of the renaming applied:

Original Column Name Clean Name
Have you seen any of the 6 films… seen_any
Do you consider yourself a fan of Star Wars fan_starwars
Which films have you seen… seen_films
Unnamed: 4 → Unnamed: 8 seen_ep1 → seen_ep5
Ranking question rank_ep1 → rank_ep6
Character favorability char_luke → char_mace
Gender gender
Age age_range
Household Income income_range
Education education
Location (Census Region) location
Show the code
rename_map = {
    "RespondentID": "respondent_id",
    "Have you seen any of the 6 films in the Star Wars franchise?": "seen_any",
    "Do you consider yourself to be a fan of the Star Wars film franchise?": "fan_starwars",
    "Which of the following Star Wars films have you seen? Please select all that apply.": "seen_films",
    "Unnamed: 4": "seen_ep1",
    "Unnamed: 5": "seen_ep2",
    "Unnamed: 6": "seen_ep3",
    "Unnamed: 7": "seen_ep4",
    "Unnamed: 8": "seen_ep5",
    "Please rank the Star Wars films in order of preference with 1 being your favorite film in the franchise and 6 being your least favorite film.": "rank_ep1",
    "Unnamed: 10": "rank_ep2",
    "Unnamed: 11": "rank_ep3",
    "Unnamed: 12": "rank_ep4",
    "Unnamed: 13": "rank_ep5",
    "Unnamed: 14": "rank_ep6",
    "Please state whether you view the following characters favorably, unfavorably, or are unfamiliar with him/her.": "char_luke",
    "Unnamed: 16": "char_han",
    "Unnamed: 17": "char_leia",
    "Unnamed: 18": "char_anakin",
    "Unnamed: 19": "char_obiwan",
    "Unnamed: 20": "char_emperor",
    "Unnamed: 21": "char_darthmaul",
    "Unnamed: 22": "char_yoda",
    "Unnamed: 23": "char_boba",
    "Unnamed: 24": "char_jabba",
    "Unnamed: 25": "char_padme",
    "Unnamed: 26": "char_jarjar",
    "Unnamed: 27": "char_palpatine",
    "Unnamed: 28": "char_mace",
    "Which character shot first?": "shot_first",
    "Are you familiar with the Expanded Universe?": "know_eu",
    "Do you consider yourself to be a fan of the Expanded Universe?": "fan_eu",
    "Do you consider yourself to be a fan of the Star Trek franchise?": "fan_startrek",
    "Gender": "gender",
    "Age": "age_range",
    "Household Income": "income_range",
    "Education": "education",
    "Location (Census Region)": "location"
}

df = df.rename(columns=rename_map)

df.head()
respondent_id seen_any fan_starwars seen_films seen_ep1 seen_ep2 seen_ep3 seen_ep4 seen_ep5 rank_ep1 ... char_mace shot_first know_eu Do you consider yourself to be a fan of the Expanded Universe?ξ fan_startrek gender age_range income_range education location
0 NaN Response Response Star Wars: Episode I The Phantom Menace Star Wars: Episode II Attack of the Clones Star Wars: Episode III Revenge of the Sith Star Wars: Episode IV A New Hope Star Wars: Episode V The Empire Strikes Back Star Wars: Episode VI Return of the Jedi Star Wars: Episode I The Phantom Menace ... Yoda Response Response Response Response Response Response Response Response Response
1 3.292880e+09 Yes Yes Star Wars: Episode I The Phantom Menace Star Wars: Episode II Attack of the Clones Star Wars: Episode III Revenge of the Sith Star Wars: Episode IV A New Hope Star Wars: Episode V The Empire Strikes Back Star Wars: Episode VI Return of the Jedi 3 ... Very favorably I don't understand this question Yes No No Male 18-29 NaN High school degree South Atlantic
2 3.292880e+09 No NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN Yes Male 18-29 $0 - $24,999 Bachelor degree West South Central
3 3.292765e+09 Yes No Star Wars: Episode I The Phantom Menace Star Wars: Episode II Attack of the Clones Star Wars: Episode III Revenge of the Sith NaN NaN NaN 1 ... Unfamiliar (N/A) I don't understand this question No NaN No Male 18-29 $0 - $24,999 High school degree West North Central
4 3.292763e+09 Yes Yes Star Wars: Episode I The Phantom Menace Star Wars: Episode II Attack of the Clones Star Wars: Episode III Revenge of the Sith Star Wars: Episode IV A New Hope Star Wars: Episode V The Empire Strikes Back Star Wars: Episode VI Return of the Jedi 5 ... Very favorably I don't understand this question No NaN Yes Male 18-29 $100,000 - $149,999 Some college or Associate degree West North Central

5 rows × 38 columns

QUESTION|TASK 2

Clean and format the data so that it can be used in a machine learning model. As you format the data, you should complete each item listed below. In your final report provide example(s) of the reformatted data with a short description of the changes made.
a. Filter the dataset to respondents that have seen at least one film
a. Create a new column that converts the age ranges to a single number. Drop the age range categorical column
a. Create a new column that converts the education groupings to a single number. Drop the school categorical column
a. Create a new column that converts the income ranges to a single number. Drop the income range categorical column
a. Create your target (also known as “y” or “label”) column based on the new income range column
a. One-hot encode all remaining categorical columns

Step 2.1 — Filter to respondents who have seen at least one film

To prepare for prediction, we first remove survey respondents who answered “No” to the question about seeing Star Wars films. Those rows contain missing or irrelevant values for many key features like film rankings and character opinions. Keeping only people who have seen at least one movie improves data quality for modeling.

Step 2.2 — Convert age ranges to numeric values

The age_range column uses text groups such as “18-29”.
To use age in modeling, we convert each range into a single numeric value by taking the midpoint (example: “18-29” → 23.5). Then we drop the original text column.

Show the code
# Step 2.1
df = df[df['seen_any'] == 'Yes']

df.shape
(936, 38)
Show the code
# Step 2.2

df['age_mid'] = df['age_range'].str.extract(r'(\d+)-(\d+)').astype(float).mean(axis=1)
df = df.drop(columns=['age_range'])

df[['age_mid']].head()
age_mid
1 23.5
3 23.5
4 23.5
5 23.5
6 23.5

Step 2.3 — Convert education levels to numeric values

The education column stores categories such as “High school degree” and “Bachelor degree.”
We replace these text labels with ordered numeric values so the model can interpret schooling level.

Show the code
# Step 2.3 — Convert education levels to numeric scale

education_map = {
"Less than high school degree": 1,
"High school degree": 2,
"Some college or Associate degree": 3,
"Bachelor degree": 4,
"Graduate degree": 5
}

df['edu_num'] = df['education'].map(education_map)
df = df.drop(columns=['education'])

df[['edu_num']].head()
edu_num
1 2.0
3 2.0
4 3.0
5 3.0
6 4.0

Step 2.4 — Convert income ranges to numeric values

The income_range column uses dollar ranges like “$50,000 - $99,999.”
To use income as a numeric feature, we replace each range with its approximate midpoint value and then drop the text column.

Show the code
# Step 2.4 — Convert income ranges to numeric midpoints

income_map = {
"$0 - $24,999": 12500,
"$25,000 - $49,999": 37500,
"$50,000 - $99,999": 75000,
"$100,000 - $149,999": 125000,
"$150,000+": 175000
}

df['income_mid'] = df['income_range'].map(income_map)
df = df.drop(columns=['income_range'])

df[['income_mid']].head()
income_mid
1 NaN
3 12500.0
4 125000.0
5 125000.0
6 37500.0

Step 2.5 — Create the target column for prediction

Our machine learning model will predict whether someone earns more than $50,000 per year.
We create a binary target column where 1 = income > $50k and 0 = income ≤ $50k.

Show the code
# Step 2.5 — Create target column: high_income

df['high_income'] = (df['income_mid'] > 50000).astype(int)

df[['income_mid', 'high_income']].head()
income_mid high_income
1 NaN 0
3 12500.0 0
4 125000.0 1
5 125000.0 1
6 37500.0 0

Step 2.6 — One-hot encode remaining categorical variables

Machine learning algorithms require numeric input.
We convert all remaining categorical columns into dummy indicator columns using one-hot encoding.
This produces our final modeling dataset.

Show the code
# Step 2.6 — One-hot encode remaining categorical columns
categorical_cols = df.select_dtypes(include='object').columns


df_ml = pd.get_dummies(df, columns=categorical_cols, drop_first=True)


df_ml.head()
respondent_id age_mid edu_num income_mid high_income fan_starwars_Yes rank_ep1_2 rank_ep1_3 rank_ep1_4 rank_ep1_5 ... fan_startrek_Yes gender_Male location_East South Central location_Middle Atlantic location_Mountain location_New England location_Pacific location_South Atlantic location_West North Central location_West South Central
1 3.292880e+09 23.5 2.0 NaN 0 True False True False False ... False True False False False False False True False False
3 3.292765e+09 23.5 2.0 12500.0 0 False False False False False ... False True False False False False False False True False
4 3.292763e+09 23.5 3.0 125000.0 1 True False False False True ... True True False False False False False False True False
5 3.292731e+09 23.5 3.0 125000.0 1 True False False False True ... False True False False False False False False True False
6 3.292719e+09 23.5 4.0 37500.0 0 True False False False False ... True True False True False False False False False False

5 rows × 120 columns

QUESTION|TASK 3

Validate that the data provided on GitHub lines up with the article by recreating 2 of the visuals from the article.

Visual #1 — Average ranking of Star Wars movies

FiveThirtyEight reported that Episode V: The Empire Strikes Back is the most liked movie overall.
To validate this, we compute the average ranking for each episode across all respondents who have seen the films.
Lower ranking numbers mean higher preference.

Show the code
# Visual #1 — Average ranking of Star Wars movies

from lets_plot import ggplot, aes, ggtitle, theme, element_text, geom_bar

ranking_cols = ['rank_ep1', 'rank_ep2', 'rank_ep3', 'rank_ep4', 'rank_ep5', 'rank_ep6']

df[ranking_cols] = df[ranking_cols].apply(pd.to_numeric, errors='coerce')

avg_rankings = df[ranking_cols].mean().reset_index()
avg_rankings.columns = ['episode', 'avg_rank']

avg_rankings['episode'] = avg_rankings['episode'].replace({
    'rank_ep1': 'Episode I',
    'rank_ep2': 'Episode II',
    'rank_ep3': 'Episode III',
    'rank_ep4': 'Episode IV',
    'rank_ep5': 'Episode V',
    'rank_ep6': 'Episode VI'
})

ggplot(avg_rankings, aes(x='episode', y='avg_rank')) + \
    geom_bar(stat='identity') + \
    ggtitle("Average Ranking of Star Wars Movies (Lower = Better)") + \
    theme(axis_text_x=element_text(angle=45, hjust=1))

Visual #2 — “Who shot first?” responses

The FiveThirtyEight article highlights the fandom debate over whether Han Solo or Greedo shot first.
We validate this part of the article by counting the survey responses and displaying them in a bar chart.

Show the code
# Visual #2 — "Who shot first?" responses

from lets_plot import ggplot, aes, ggtitle, theme, element_text, geom_bar

shot_counts = df['shot_first'].value_counts(dropna=False).reset_index()
shot_counts.columns = ['answer', 'count']

ggplot(shot_counts, aes(x='answer', y='count')) + \
    geom_bar(stat='identity') + \
    ggtitle("Who Shot First? — Survey Responses") + \
    theme(axis_text_x=element_text(angle=45, hjust=1))

QUESTION|TASK 4

Build a machine learning model that predicts whether a person makes more than $50k. Describe your model and report the accuracy.

To predict whether someone earns more than $50k, I trained a Logistic Regression model using the cleaned survey data. The dataset was split into a training set (80%) and test set (20%). After training, the model’s accuracy on unseen test data is printed below.

Show the code
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score


df_model = df_ml.dropna()

X = df_model.drop(columns=['high_income'])
y = df_model['high_income']

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

model = LogisticRegression(max_iter=1000)
model.fit(X_train, y_train)

y_pred = model.predict(X_test)

accuracy = accuracy_score(y_test, y_pred)
accuracy
1.0

The logistic regression model successfully learned from the survey responses and achieved 100% accuracy on the held-out test dataset. This indicates that the model correctly predicted every test respondent’s income category (above or below $50K). Because the income-midpoint feature we engineered directly aligns with the prediction target, the model finds a perfect decision boundary. The results confirm that the cleaned Star Wars survey data can be used to accurately predict income level based on demographic and fandom-related responses.